Simulation-Based Optimization Of Markov Controlled Processes With Unknown Parameters

نویسنده

  • Enrique Campos-Náñez
چکیده

We consider simulation-based gradient-estimation and its use in Markov controlled processes with unknown parameters. We consider a Markov reward process controlled by both a set of tunable parameters, and a set of fixed but unknown. We analyze the use a recursive identification procedure, and their application to existing gradient-based algorithms based on simulation. We show that simple modifications of available gradient estimation algorithms, namely assuming parameter certainty, can accommodate system parameter identification, without sacrificing the convergence of these to local optima by following a two-time-scale recursive identification/optimization procedure. This approach is illustrated through an application to the algorithm proposed in (Marbach and Tsitsiklis, 2001). We illustrate our results with a small numerical example, which we further use to test the ability of the proposed scheme to track slow changing system parameters. INTRODUCTION Many stochastic resource allocation and control problems can be naturally modeled using dynamic programming. Unfortunately, the dynamic programming algorithm suffers from the ‘curse of dimensionality’. In order to address these performance problems, researchers have recurred to the use of parameterized control policies, as in (Marbach and Tsitsiklis, 2001), reducing the problem to one of search in policy space. Several simulationbased techniques exist (Marbach and Tsitsiklis, 2001, 2003; Cao and Chen, 1997; Cao and Wan, 1998; Fang and Cao, 2004; Baxter and Bartlett, 2001) that use realizations of the stochastic processes to obtain gradient estimates that can be used in a gradient-based search method in policy space. Many of the algorithms mentioned above are ‘modelfree’ and, hence, implicitly adaptive, e.g., general stochastic approximations (Kiefer and Wolfowitz, 1952; Kushner and Yin, 1997), but they exhibit slow convergence due to biased gradient estimates, or large variances. This problem can be alleviated by incorporating knowledge of the system model, for example by exploiting regenerative structure to eliminate bias, but by requiring knowledge of all system parameters, the usefulness of such techniques is limited. Although the problem of adaptive control of Markov chains has been studied before, most of authors address the case of a finite number of possible models, as in (Mandl, 1974; Borkar and Varaiya, 1979; Kumar and Becker, 1982; Doshi, 1980; El-Fattah, 1981). Other authors, overcome this restriction, but assume that the optimal policies for each model can be easily computed, such as in (Borkar and Varaiya, 1982). Others, embed the estimation process in a value iteration procedure (Hernández-Lerma, 1989), which is not efficient computationally when the state spaces are large. The work in (Ren and Krogh, 2001; Santharam and Sastry, 1997) estimates Q-factors, a technique that is adaptive since no system model is in principle needed, but requires the action set to be finite in order to build a Q-factor for each state-action pair, and quickly becomes impractical as the state space increases. In some specific cases, the Q-factors can be approximated via a neural network or other architectures (Bertsekas and Tsitsiklis, 1996), with limited success. In this paper we consider the problem of adaptive optimization of parameterized Markov reward processes. We first study the problem of online estimation and optimization of the average reward criterion in Markov reward processes for which the transition probabilities, as well as the expected reward per stage, are functions of two sets of parameters: 1) a set of tunable controls, and 2) a fixed but unknown parameter. We study estimation procedures that can be updated after each transition in the chain, generalizing some of the results of (Campos-Nanez and Patek, 2005), apply them to existing simulation-based algorithms, such as the algorithm of (Marbach and Tsitsiklis, 2001), providing sufficient conditions for the convergence of the resulting method to a local optimum. The paper is organized as follows. We first present the problem of adaptive Markov control, and address the problem of estimation of an unknown parameter. We follow this with a discussion about the use of estimators in Proceedings 23rd European Conference on Modelling and Simulation ©ECMS Javier Otamendi, Andrzej Bargiela, José Luis Montes, Luis Miguel Doncel Pedrera (Editors) ISBN: 978-0-9553018-8-9 / ISBN: 978-0-9553018-9-6 (CD) optimization algorithms, and illustrate our approach using an algorithm developed in (Marbach and Tsitsiklis, 2001). We provide sufficient conditions for the convergence of the adaptive scheme to local optima, while identifying the true system parameter values. The resulting adaptive algorithm has small memory and computational requirements, and can hence be implemented in an fashion. Its performance is discussed in the last section of the paper, where its usability under slow changing parameters is explored. MARKOV REWARD PROCESSES WITH UNKNOWN PARAMETERS Let {in}n be a discrete-time Markov chain with finite state space S = {1,2, . . . ,N}. There is a set of ‘tunable’ parameters u ∈ R , and a fixed but unknown parameter θ ∗ ∈Θ, a convex, compact subset of R. These two sets of parameters determine the dynamics of the Markov chain in the sense that the transition probabilities pi j(u,θ ∗) = P(in+1 = j | in = i,u,θ ∗), are functions of the control vector u, and the parameter θ ∗. We define P(u,θ ∗) to be the transition probability matrix with entries pi j(u,θ ∗). The reward observed by this system is also a function of parameters u, and θ ∗, i.e., the expected reward per stage when the system is at state i, and controls u are applied is a function of u, θ ∗, denoted by gi(u,θ ∗). We focus our attention to the average reward criterion, which can be defined as λ (u,θ ∗) = liminf T→∞ 1 T E [ T ∑ n=0 gin(u,θ ∗) ∣

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Designing an Economic Repetitive Sampling Plan in the Presence of Two Markets

In this paper, we develop an optimization model for the economic design of repetitive sampling plan in the presence of two markets. The process under consideration produces a product with a normally distributed quality characteristic with unknown mean and known variance. The quality characteristic has a lower specification limit. The quality of the product is controlled via lot-by-lot acceptanc...

متن کامل

Simulation-Based Optimization of Markov Reward Processes: Implementation Issues

We consider discrete time, finite state space Markov rewaxd processes which depend on a set of parameters. Previously, we proposed a simulation-based methodology to tune the parameters to optimize the average reward. The resulting algorithms converge with probability 1, but may have a high variance. Here we propose two approaches to reduce the variance, which however introduce a new bias into t...

متن کامل

Simulation - Based Optimization of Markov

We propose a simulation-based algorithm for optimizing the average reward in a Markov Reward Process that depends on a set of parameters. As a special case, the method applies to Markov Decision Processes where optimization takes place within a parametrized set of policies. The algorithm involves the simulation of a single sample path, and can be implemented on-line. A convergence result (with ...

متن کامل

RELIABILITY–BASED DESIGN OPTIMIZATION OF CONCRETE GRAVITY DAMS USING SUBSET SIMULATION

The paper deals with the reliability–based design optimization (RBDO) of concrete gravity dams subjected to earthquake load using subset simulation. The optimization problem is formulated such that the optimal shape of concrete gravity dam described by a number of variables is found by minimizing the total cost of concrete gravity dam for the given target reliability. In order to achieve this p...

متن کامل

Estimating the Parameters in Photovoltaic Modules: A Constrained Optimization Approach

This paper presents a novel identification technique for estimation of unknown parameters in photovoltaic (PV) systems. A single diode model is considered for the PV system, which consists of five unknown parameters. Using information of standard test condition (STC), three unknown parameters are written as functions of the other two parameters in a reduced model. An objective function and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009